Before I begin plotting the data, I want to first figure out a couple of things about the variables. First, how many of each quality are there?
summary(wf)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
table(wf$quality)
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
There are 4898 observations of 12 variables. Each observation (or row) is 11 variables descirbing various chemical/physical aspects of a wine plus the median of the ratings given by judges of that wine, 0 being the lowest rating and 10 being the highest.
Quality is the feature of interest - the goal of this analysis is to explore what other features of the data explain the quality the wines.
From what I have read in the readme for the dataset, I am expecting levels of sulfur dioxide to play a part in determining quality - it seems like there ought to be a balance of sulfur dioxide. Too much will cause a bad, sulfurous odor, while too little may make the wine not fresh. Beyond that, my non-existent knowledge of wine would have me expect that sugar levels, alcohol content, and salt content would all have some sort of effect on quality, though in what way I really have no idea at this point. I also expect acidity to be a factor in determining quality
As you can see, there are no wines with ratings of 0, 1, 2, or 10. There are only 5 wines with ratings of 9 and 20 with ratings of 3. This seems like a good indication that I can group some of these variables together into buckets: “high”, “medium-high”, “medium”, “medium-low”, and “low”. I’ll do this by adding a new variable: quality_level. This will let me use geom_freqpoly and facet_wrap more effectively, since I won’t have one category with only 5 observations in it and another category with over 2000. Low is 3 and 4, medium low is 5, medium is 6, medium-high is 7, and high is 8 and 9.
Here, I try to perform some aggregations on sugar, pH, alcohol, and acidity to see if anything pops out. I will now make a few graphs. And just to note, I make a new feature called ‘total.acidity’ which is just the sum of fixed acidity, volatile acidity, and citric acid.
I really wish the data for quality was more continuous - on a range of 1 to 10 based on an average of all the judges’ scores for that particular wine. But alas, the data gods are not so kind. Judging from the fitted lines, it would seem that, generally, higher quality wines have less fixed and volatile acidity than lower quality wines, while higher quality wines tend to have more citric acid than lower quality wines. In terms of general acidity, higher quality wines tend to have higher pH values, which makes sense. This means higher quality wines are more basic, i.e. less acidic, and this is obviously in line with the general trend that total acidity goes down as wine quality goes up. Still, these lines are not very strong fits - for volatile acidity, it seems tht a parabolic curve would be a better fit.
These graphs show that the correlation between acidity and pH is slightly less strong, but it still exists. Higher wine quality seems to predict lower acidity.
Alcohol content and wine quality are pretty closely correlated. We can use this in our model.
What seems interesting here is that while plotting sugar on its own against quality does not show much of a correlation, plotting residual sugar against alcohol and then coloring by quality seems to show that higher quality wines, which tend to have higher alcohol contents, also tend to have lower sugar levels than wines with lower alcohol contents. It is clear that plotting sugar with alcohol content strengthened both features.
As expected, both alcohol content and residual sugar are highly correlated with density. If we were to create a linear regression model for quality, we should avoid having all three of these variables in the model, as multicollinearity would become a significant problem.
Immediately we see that there is a strong likelihood that salt content is correlated with quality. We will further investigate.
These plots confirm that lower salt content is correlated with higher wine quality. What is interesting is the inverse correlation between alcohol and chlorides, which I would not have expected. It seems that there are no wines with low alcohol content and low chloride levels and no wines with high alcohol content and high chloride levels. I am not sure why that is - perhaps it is a side effect of making wine with high alcohol content, or that high quality wines are produced with the goal of high alcohol content and low salt content in mind. Regardless, they are correlated, so we should bear that in mind while constructing a model so as to keep multicollinearity at a minimum.
These do not tell us all that much. Let us investigate further. We will move directly into plotting these features against other possible explanatory features to see if any unexpected results show up.
It becomes clearer here that lower total sulfur dioxide seems to be correlated with higher wine quality. And, as expected, free sulfur dioxide and total sulfur dioxide are correlated with each other.
First, let us talk about the features other than the feature of interest that are correlated with each other. Some were obvious and expected, others are not: - alcohol and density - sugar and density - chlorides and alcohol - total sulfur dioxide and free sulfur dioxide - all the various acidities with pH
Now, let us list how these features relate to quality: - higher alcohol and higher quality - higher citric acid and higher quality - lower sugar and higher quality - lower total sulfur dioxide and higher quality - lower pH and higher quality - lower salt content and higher quality
With this information, we can improve on our expectations of what makes for a high quality wine. Good wines tend to have higher alcohol contents, fruitier flavor (due to higher citric acid content), lower sugar levels, lower salt levels, lower sulfur dioxide levels, and lower overall acidity. I have left out features such as density, which is too strongly correlated with more important features such as alcohol content and chloride levels, and sulphates, which does not seem to be correlated with quality and is only very slighty correlated with total sulfur dioxide.
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Attaching package: 'memisc'
##
## The following objects are masked from 'package:dplyr':
##
## collect, query, rename
##
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
##
## The following objects are masked from 'package:base':
##
## as.array, trimws
##
##
## Attaching package: 'GGally'
##
## The following object is masked from 'package:dplyr':
##
## nasa
##
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:memisc':
##
## percent
##
## Calls:
## m1: lm(formula = alcohol ~ quality, data = wf)
## m2: lm(formula = alcohol ~ quality + citric.acid, data = wf)
## m3: lm(formula = alcohol ~ quality + citric.acid + chlorides, data = wf)
## m4: lm(formula = alcohol ~ quality + citric.acid + chlorides + pH,
## data = wf)
## m5: lm(formula = alcohol ~ quality + citric.acid + chlorides + pH +
## residual.sugar, data = wf)
## m6: lm(formula = alcohol ~ quality + citric.acid + chlorides + pH +
## residual.sugar + total.sulfur.dioxide, data = wf)
##
## =======================================================================================
## m1 m2 m3 m4 m5 m6
## ---------------------------------------------------------------------------------------
## (Intercept) 6.957*** 7.206*** 8.284*** 6.872*** 9.358*** 9.510***
## (0.106) (0.115) (0.120) (0.345) (0.316) (0.305)
## quality 0.605*** 0.604*** 0.524*** 0.518*** 0.480*** 0.442***
## (0.018) (0.018) (0.017) (0.017) (0.016) (0.015)
## citric.acid -0.729*** -0.413*** -0.327** -0.087 0.112
## (0.130) (0.125) (0.127) (0.113) (0.110)
## chlorides -15.566*** -15.400*** -14.243*** -12.450***
## (0.710) (0.710) (0.634) (0.619)
## pH 0.443*** -0.115 0.102
## (0.102) (0.092) (0.089)
## residual.sugar -0.096*** -0.075***
## (0.003) (0.003)
## total.sulfur.dioxide -0.007***
## (0.000)
## ---------------------------------------------------------------------------------------
## R-squared 0.190 0.195 0.267 0.270 0.419 0.459
## adj. R-squared 0.190 0.195 0.266 0.269 0.418 0.459
## sigma 1.108 1.104 1.054 1.052 0.938 0.905
## F 1146.395 592.379 593.952 451.873 705.603 692.930
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -7450.661 -7435.065 -7205.503 -7195.982 -6636.054 -6459.235
## Deviance 6009.118 5970.970 5436.702 5415.605 4308.757 4008.627
## AIC 14907.323 14878.130 14421.007 14403.964 13286.109 12934.469
## BIC 14926.812 14904.116 14453.490 14442.943 13331.585 12986.442
## N 4898 4898 4898 4898 4898 4898
## =======================================================================================
In the end, using just a pretty basic linear model, we get an R-squared of 0.459, which is not too shabby. Of course, this is far from a perfect model - a linear regression simply cannot capture all the subtleties of the data. I also included both chlorides and alcohol in the model, even though I already know that they are correlated with each other. Thus, there is some degree of multicollinearity that is negatively affecting the truth of the model.
This plot shows very obviously that there is definitely a trend towards higher alcohol content as wine quality increases. Just having a higher alcohol content seems to be a huge factor in determining wine quality - the entire boxplot moves up for each increase in quality level, which is not something I would have expected. It really makes me wonder why exactly alcohol is so strongly correlated with wine quality, and whether that bears out in real life. This plot sparked much of the exploration in regard to whether other features were strongly correlated with alcohol - is higher alcohol content a result of a general higher-quality wine making process, or is it purposefully sought after in the wine making process? I spent much of my time trying to explore this angle in this report.
I selected these first two plots because they reveal quirks of the data that you wouldn’t have been able to see otherwise. During my EDA, it was hard to see whether sugar content related at all to wine quality - different levels of sugar content seemed to be distributed quite evenly across all wine qualities. However, this plot immediately reveals two things: (1) higher quality alcohol does, in fact, have lower sugar levels, and (2) there are no high alcohol and high sugar content wines. The insights this plot offered me meant I now was willing to use residual sugar as a feature in the linear regression model I hoped to build, since it was clearly correlated with wine quality. And this paid off - adding residual.sugar to my linear model raised the R-squared value (unadjusted) from 0.270 to 0.419.
This reveals a relationship between features that I had not expected at all. For some reason, alochol content seems to be inversely correlated with salt content - and high quality wines are overwhelmingly concentrated in the area of the plot where salt content is low and alcohol percentage is high.
This project was intimidating at first because there were so many features. Which ones should I concentrate on? Which ones would actually have any effect? And once I plotted the distributions of each with regard to quality, I did not come out as elucidated as I had thought I would be - only alcohol and perhaps salt seemed to contribute in any way to wine quality. This was unlike the diamond data set, in which features were fewer and there were universally defined metrics for what made a better diamond.
Still, there were a few common sense hunches that I had regarding what would affect wine quality - I feel that oftentimes, our own intuition is where we begin in such investigations, and in the process of confirming or invalidating those intuitions, we discover new quirks and trends that would not have occurred to us without such exploration. That is what happened with me - I felt that sugar levels and acidity ought to have some significant effect on wine quality.
After trying various plots with sugar levels, I was about ready to give up. There seemed to be no rhyme nor reason with sugar content across different quality wines. However, when I finally plotted sugar vs. alcohol content and colored the points by quality, sugar’s inverse relationship with wine quality finally revealed itself. Needless to say, I was pleased. However, this plot also revealed sugar’s inverse correlation with alcohol, which made me wonder why exactly would there be a relationship between alcohol and sugar? Is it because of the fermentation process that converts sugar into ethanol, and therefore the higher the alcohol content, the lower the sugar level?
This induced me to investigate further the relationship between alcohol and other features, and I found, to my surprise, that chlorides and alcohol were also inversely correlated. Wines with high alcohol contents also had low salt contents, and were generally rated higher, than wines with low alcohol contents with high salt contenst, which were generally rated lower. In fact, there seemed to be a relative dearth of wines that had both high alcohol and high salt contents as well as both low alcohol and low salt contents. This begs the same question that the discovery of sugar’s relationship with alcohol evoked: was this a result of the wine making process that naturally meant high quality wines had high alcohol contents and low salt contents, or was this due to wine makers purposely choosing to make wines with these characteristics? I do not think this is a question that can be answered with EDA alone - it would require an understanding of the wine making process as well.
Once I got the ball rolling in mixing and matching features to see if anything strange and interesting popped out, it was a relatively straightforward process to see how acidity related to wine quality. Strangely enough, it turned out that higher citric acid was correlated with higher wine quality even though overall acidity (as measured by pH and my total acidity variable) was correlated with lower wine quality. I attributed this to higher citric acid levels making wines taste fruitier. Also, total acidity was dominated by fixed acidity - citric acid was a small enough component of total acidity that its level was nearly neglible in determining pH, so this was actually a finding that made sense. Too acidic of a wine probably tastes bad, but fruitier wine tastes better.
There are still many things that could be done. There are some combinations of features that I have not plotted - namely, that between sulphates and sulfur dioxide levels with density, and whether that could change anything in my analysis. Perhaps using more boxplots would also reveal some interesting things.
Also, if I were to spend more time on this, I would likely create more robust models for predicting wine quality - using naive Bayes, or vector models, or a logistic regression. My linear model had decent results, but is not as good of a model as a model could be.
I would like to actually compare white wines with red wines - there would probably be a lot of interesting insights into the character of these two wines, in terms of their various acidities, alcohol contents, sugar levels, etc., and what makes for a high quality red or white wine.